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ABSTRACT 

The  practical  application  of  stochastic  approximation 
methods  require  a  reliable  means  to  stop  the  iterative  pro¬ 
cess  when  the  estimate  is  close  to  the  optimal  value  or  when 
further  improvement  of  the  estimate  is  doubtful.  Conven¬ 
tional  ideas  on  stopping  stochastic  algorithms  employ  prob¬ 
abilistic  criteria  based  on  the  asymptotic  distribution  of  the 
stochastic  approximation  process,  often  with  the  parame¬ 
ters  of  the  distribution  determined  by  sequential  estima¬ 
tion.  Difficulties  may  arise  when  this  approach  is  applied 
to  small  (finite)  samples.  We  propose  a  different  approach 
that  uses  the  notion  of  an  idealized  process  as  a  companion 
to  the  stochastic  approximation.  A  discussion  of  this  ap¬ 
proach  to  stopping  stochastic  approximation  is  offered  in 
the  context  of  a  simple  example,  including  some  empirical 
results. 
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1.  INTRODUCTION 

Introduced  by  Robbins  and  Monro  [11]  in  1951, 
stochastic  approximation  is  the  adaptation  of  iterative 
optimization  and  root-finding  methods  to  stochastic 
problems.  Robbins  and  Monro  proved  convergence  in 
probability  of  a  sequence  of  estimates  to  an  optimal 
value,  results  which  have  since  been  strengthened  to 
almost  sure  convergence.  Yet  these  results  only  hold 
asymptotically.  In  practical  applications  the  process 
must  have  a  stopping  condition,  and  some  statements 
on  when  to  stop  the  procedure  or  about  the  quality  of 
the  solution  obtained  must  be  made. 

The  need  for  a  stopping  rule  for  stochastic  approx¬ 
imation  was  recognized  by  Kiefer  and  Wolfowitz  [4]. 
Since  that  time  this  issue  has  been  extensively  studied. 
Burkholder  [1]  proposed  estimators  for  the  asymptotic 
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distribution  and  derived  sufficient  conditions  for  those 
estimators  to  converge  almost  surely.  Chow  and  Rob¬ 
bins  [2]  developed  a  method  to  sequentially  determine 
a  bound  on  the  mean  of  a  continuous  random  variable 
with  unknown  variance.  Recognizing  the  applicability 
of  this  procedure  to  the  problem  of  stopping  stochastic 
approximation  they  suggested  the  rule:  stop  as  soon  as 
the  length  of  the  confidence  interval  based  on  asymp¬ 
totic  normality  of  the  sample  means  is  smaller  than 
25  for  some  J  >  0,  or,  equivalently,  stop  as  soon  as 
the  estimated  standard  deviation  of  the  sample  mean 
is  sufficiently  small. 

Since  this  initial  work  much  of  the  effort  in  stopping 
stochastic  approximation  has  been  on  the  estimation 
of  the  parameters  of  the  asymptotic  distribution  in  or¬ 
der  to  apply  one  of  the  above  criteria.  Sielken  [12] 
and  Stroup  and  Braun  [16]  both  improved  on  the  pio¬ 
neering  work  of  Burkholder,  applying  sequential  esti¬ 
mation  to  calculate  estimators  and  develop  confidence 
intervals  in  one  dimension.  These  results  were  later 
extended  to  the  multi-dimensional  case  [9,  17].  The 
conditions  that  guarantee  the  asymptotic  validity  for 
sequentially  estimated  parameters  were  established  in 
[3]. 

2.  PROBLEM  FORMULATION 

2.1  The  Stochastic  Approximation  Process 

We  consider  only  the  unconstrained  case.  Suppose 
0  G  is  a  vector  with  components  representing  con¬ 
trol  parameters.  Let  Q{6,uj)  denote  the  observed  re¬ 
sponse  as  a  function  of  9  and  the  stochastic  effect  co. 
The  function  of  interest  is  L{9)  =  E[Q{9,uj)],  the  ex¬ 
pected  response  at  9.  The  objective  is  to  find  the  value 
9*  that  minimizes  L{9): 

9*  =  argmin  L{9). 

0 

If  the  parameters  9  are  continuous  and  the  solu¬ 
tion  space  can  be  assumed  closed  and  convex,  then  the 
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problem  lends  itself  to  solution  with  a  gradient-based 
optimization  method.  We  choose  an  initial  estimate 
00  and  update  it  with  the  following  scheme: 


0k-\-l  —  0k  klkGk{0k^^^  (1) 

where  Gk{0k,i^)  &  is  some  (noisy)  input  infor¬ 
mation  related  to  the  gradient  of  the  process  being 
studied  [4,  11].  A  general  discussion  of  the  stochas¬ 
tic  approximation  method  may  be  found  in  [15].  The 
asymptotic  properties  of  0k  are  well-known  [6] ,  and  un¬ 
der  relatively  mild  conditions  the  sequence  of  iterates 
generated  by  (1)  converges  to  0*  almost  surely.  See, 
for  example,  [8]  or  [5].  Additionally,  when  properly 
scaled,  the  distribution  function  of  the  iterates,  which 
we  denote  Fk,  is  asymptotically  normal.  We  denote 
the  asymptotic  distribution  by  F* . 

2.2  Stopping  the  Approximation  Process 

Since  the  estimates  0k  are  random,  the  preferred  ap¬ 
proach  is  to  consider  the  probability  that  some  toler¬ 
ance  conditions  are  met.  Ideally  we  prefer  conditions 
of  the  form 


p§,  (ll  k-d*\\<s 

P^^{\L{0k)-L{0*)\  <6 


>1-0  (2a) 

>1  —  0  (2b) 


(see  also  [9,  10]).  In  a  practical  sense  these  conditions 
would  allow  us  to  compute  the  1  —  o  confidence  el¬ 
lipse  about  0k-  Given  an  o  G  [0, 1]  and  (5  >  0,  we 
stop  at  time  k(o,  6)  equal  to  the  smallest  k  such  that 
either  condition  in  (2)  is  true.  To  formalize  the  prob¬ 
lem,  a  customary  approach  is  to  define  B{0k,S),  a  ball 
of  radius  d  about  0k,  and  let  k  be  the  first  time  the 
confidence  ellipse  based  on  (2)  is  contained  within  the 
ball. 

Unfortunately,  direct  calculation  of  the  probabilities 
in  (2)  is  not  possible  since  the  distribution  of  0k  (even 
the  distributional  form)  is  generally  not  known.  Some 
manner  of  estimation  of  the  distribution  function  Fk 
is  essential,  and  if  convergence  in  distribution  is  suf¬ 
ficiently  fast,  one  solution  is  to  use  F*  in  lieu  of  Fk 
and  estimate  the  parameters  of  F*.  Since  the  form 
of  the  distribution  F*  is  known,  this  is  a  well-defined 
problem. 

The  method  requires  knowledge  of  the  covariance 
matrix  S  of  the  asymptotic  normal  distribution,  and 
the  Hessian  at  the  optimal  point,  F[{0*),  of  the  un¬ 
derlying  function.  These  matrices  are  not  commonly 
available,  so  the  usual  procedure  is  to  estimate  them 
sequentially  as  part  of  the  iteration.  Given  initial  es¬ 
timates  for  S  and  H{0*),  these  estimates  are  then 
updated  at  each  step  (or  perhaps  every  m  steps)  as 


the  iterative  process  proceeds,  and  the  estimators  are 
asymptotically  correct. 

This  approach  is  less  satisfactory  for  finite-sample 
(non-asymptotic)  stochastic  approximation.  The  diffi¬ 
culties  extend  beyond  the  fact  that  there  may  not  be 
sufficiently  many  iterations  to  compute  reliable  esti¬ 
mates  for  S  and  H{0*).  Finite-sample  behavior  could 
differ  significantly  from  asymptotic  behavior.  This  is 
a  difficulty  with  the  basic  assumption  that  F*  can  re¬ 
place  Fk  in  the  calculation  of  the  confidence  ellipses. 
The  assumption  is  a  good  one  only  asymptotically. 

3.  IDEALIZED  PROCESSES 

The  concept  of  idealized  processes  is  to  develop  a  pa¬ 
rameterized  companion  to  the  original  process  whose 
properties  are  known  when  the  parameter  is  some  pos¬ 
itive  number  (giving  the  idealized  process),  and  whose 
behavior  reflects  that  of  the  original  when  the  param¬ 
eter  is  zero.  The  expectation  is  that  conclusions  we 
draw  about  the  idealized  process  can  be  related  to  the 
original  process  in  some  way  determined  by  the  pa¬ 
rameter. 

This  is  a  relatively  new  idea,  and  the  theoretical 
justification  for  such  a  procedure  is  incomplete.  The 
applicability  of  this  method,  however,  has  been  shown 
for  parameter  estimation  in  maximum  likelihood  esti¬ 
mation  problems,  among  others  [13,  14].  We  intend  to 
demonstrate  by  example  that  this  method  can  be  used 
to  stop  stochastic  approximation  processes  as  well. 

The  idea  of  using  idealized  processes  for  parameter 
estimation  was  advanced  by  Spall  [13,  14].  Spall’s  for¬ 
mulation  sought  an  estimate  0  for  a  parameter  vector 
0  from  a  set  of  data  whose  distribution  depended  on 
0  and  a  known  scalar  e.  When  the  sample  is  small, 
it  is  difficult  to  say  much  about  the  probabilities  of  0 
because  the  distributions  are  unknown.  One  approach 
is  to  construct  a  parameterized  process  producing  sta¬ 
tistically  similar  data  and  resulting  in  an  estimate  0 
where  the  probabilities  of  0  are  known,  and  then  look 
for  conditions  where  the  probabilities  of  0  are  close  to 
those  of  0  irrespective  of  the  sample  size. 

We  apply  the  same  principle,  but  to  the  sequence 
of  iterates  from  a  stochastic  approximation  process. 
The  idea  is  to  establish  conditions  under  which  the 
probabilities  of  0k  are  close  to  those  of  0k,  which  are 
known.  The  resulting  information  is  used  to  decide 
whether  to  stop  the  process  or,  if  stopped,  to  determine 
the  probability  of  being  close  to  0* . 

The  stochastic  approximation  process  in  its  most 
general  form  is 


^k+i  —  Tki^k,^) 


(3a) 


where  is  some  transformation  process  and  oj  de¬ 
notes  the  random  component  which  manifests  itself  as 
noise  in  the  measurements  of  the  loss  function  or  its 
gradient.  In  the  case  of  equation  (1),  Tk  is  a  nonlin¬ 
ear  operator  with  Tk{9k,oj)  =  0k  —  o:kGk{0k,uj),  where 
Gk{0k,i^)  is  a  noisy  estimate  of  the  gradient  of  L{9k). 
We  parameterize  the  transformation  with  a  scalar  rj: 


Let  Fk{0k,'^k)  be  the  true  distribution  of  9k,  and  let 
Gk{9k,^k)  be  the  idealized  distribution  of  9k.  Then 


Pf,  {0k  e  Si9*))-PG,  (4  G  S{9*))  =  o 


9  k  —  00 

(4) 


4.  EXAMPLE 


9k+i  =Tk{9k,u!;ri)-  (3b) 

For  some  rj  =  tjq  >  0  the  sequence  of  iterates  produced 
by  (3b)  is  the  same^  as  (3a).  For  77  =  0  (3b)  produces 
a  sequence  of  idealized  iterates  in  the  sense  that  the 
probabilities  for  the  iterates  are  known  for  each  k.  For 
convenience  we  denote  the  sequence  of  iterates  from 
the  idealized  process  by  9k,  and  the  distribution  func¬ 
tion  of  these  iterates  by  Gk-  (Note:  when  there  is  no 
confusion,  we  will  often  drop  the  u  from  the  expression 
for  Tk.) 

The  idealized  process  Tk{9;0)  could  represent  a  sim¬ 
plified  process  that  converges  to  the  same  0*,  but  more 
frequently  the  idealized  process  merely  mimics  some  of 
the  asymptotic  properties  of  the  true  process.  It  can¬ 
not  generally  be  shown  (nor  is  it  necessary  to  show) 
that  9k  — >  0*.  It  is  only  necessary  that  the  parame¬ 
terized  transformation  process  generate  a  sequence  of 
dependent  observations  with  distributional  properties 
(other  than  location)  that  are  similar  to  those  of  9k. 

In  general,  it  is  not  easy  to  determine  a  suitable 
parameterization  for  a  general  process.  In  the  example 
that  follows  we  assume  the  parameterization  is  known 
for  analytical  purposes,  but  further  work  in  necessary 
before  a  systematic  method  of  parameterization  can 
be  presented. 

The  justification  for  this  approach  is  found  in  the 
use  of  the  Skorokhod  representation  theorem  to  map 
the  original  process  into  another  process  on  a  different 
space  where  an  analysis  of  the  properties  of  the  process 
is  easier. 

The  general  approach  is  to  simplify  the  transforma¬ 
tion  process  to  generate  an  idealized  process  that  is 
nearly  identical  up  to  some  order.  If  the  differences 
are  small  enough  to  be  ignored,  then  the  probability 
distributions  should  be  close  as  well. 

Formalizing  this  argument  gives  the  following  the¬ 
orem:  Let  9k  =  Tk{0k-i',rio)  be  a  stochastic  approxi¬ 
mation  process  with  mean  sequence  9k  and  covariance 
process  Let  9k  =  Tk{9k-i',0)  be  a  linear  ideal¬ 
ized  process  relative  to  9k  (linearized  about  6*0),  and 
let  9k  have  covariance  process  S^.  Assume  the  con¬ 
ditions  required  for  the  convergence  of  9k  9*  hold. 
Let  S{9)  be  any  symmetric  region  about  the  point  9. 

^By  the  same  we  mean  statistically  indistinguishable. 


We  take  as  an  example  the  idealized  process  formed 
by  linearizing  the  gradient.  This  results  in  an  autore¬ 
gressive  process  whose  properties  can  be  determined 
analytically.  Consider  the  process  given  by  (1)  where 
Gk{9,(jj)  =  g{9)  +  Ckiuj)  is  the  noisy  gradient  of  L{9). 
Here  Tk{9,ui)  =  9  —  akg{9)  —  akek{ui).  The  first  order 
Taylor  expansion  of  the  gradient  about  a  point  9q  is 

g{0)  =  g{9o)  +  H{9o){9  -  0o)  +  0(||  9  -  0of  4). 

The  notation  H{9q)  =  V(/(6*o)  is  shorthand  for 
\7g{0)\  the  Jacobian  of  g  evaluated  at  9o.  Since 
g  is  the  gradient  of  L,  the  Jacobian  of  g  is  the  Hessian 
of  L,  and  we  use  the  function  H  to  denote  this.  The 
natural  parameterization  is 


Tk{9-,  rj)  =  9  -  akg{9o)  -  akH{9o){9  -  9o) 

+  r]  0{\\9  -  0o\\'^  Ip)  -  akCk.  (5) 

When  r]  =  rjo  (for  some  rjo)  we  have  the  true  process 
(1)  with  Gk{0,io)  =  g{9)  +  ek{to).  When  77  =  0  we  have 
the  approximation  process 

9k+i  =  9k  —  akg{9o)  —  0'kH{9o){9k  —  9o)  —  Ofcefc-  (6) 

Since  Tk{9k',0)  is  linear  in  the  random  components, 
it  follows  from  repeated  substitution  that  each  iterate 
9k+i  in  6  may  be  expressed  as  a  deterministic  vari¬ 
able  (depending  on  k)  plus  the  sum  of  multiples  of  the 
random  variables  {eo,  ei, . . . ,  Cfe}.  If  we  can  assume 
nice  behavior  of  the  e^,  then  the  distribution  function 
Fg^  is  known  for  any  k.  This  distribution  is  used  to 
compute  K{a,  6)  for  some  a  and  S  according  to  (2). 

We  illustrate  this  idea  by  using  a  function  from  the 
More  et  al.  [7]  suite  of  optimization  problems,  the 
so-called  variably  dimensioned  function  in  two  dimen¬ 
sions,  to  generate  stochastic  data  and  to  identify  the 
linear  idealized  process  as  a  test  of  the  procedure.  Let 
9  =  [ti  ^2]"^  G  and  Lyo  :  — >  M.  Then  the 

variably  dimensioned  function  is  defined  as 


Lvd{9)  =  {ti  —  1)^  -I-  (t2  —  1)^ 

+  (ti  -l-  2^2  ~  3)^(1  -|-  {ti  2^2  ~  3)^). 


The  gradient  is 


gvD{9) 


4ti  -b  4^2  +  4(ti  -b  2t2 
4^2  “b  10^2  8(^1  2^2 


3)3  -  8 
3)3  -  14 


This  function  satisfies  the  conditions  for  convergence 
of  (1),  and  there  is  a  unique  global  minimum  located 

at  r  =  [1  1]T. 

Suppose  the  form  of  the  loss  function  Lyo  is  not 
known,  but  we  are  able  to  provide  inputs  9  and  observe 
the  noisy  gradient.  We  assume  the  components  of 
the  noise  are  independent  and  normally  distributed 
with  mean  zero  and  variance  <7k^ .  Then  we  model 
the  process  as  follows:  for  a  sequence  of  inputs  {9k} 
we  have  a  sequence  of  observations  {Tfe}  generated  by 
Yk{0k,oj)  =  gvoiOk)  +  ek{uj).  Let  9k  =  and 

9k  =  ■  Using  Robbins-Monro  iteration  the  true 

(?7  =  770)  process  is 

9k-\-l  —  9k  Ojk^Voi^k^ 


^2 

4^1  -\-  4^2  4{t\  -\-  2^2  —  3)^  —  8 

[  4ti  +  104  +  8(ti  +  2t2  -  3)3  -  14 


and  the  idealized  process  is 


4ti  -1-  4f2 

-8 

—  Ofc 

4ti  -k  10f2 

-  14 

^k^k 

The  distribution  function  Fk  is  unknown,  and  im¬ 
possible  to  calculate.  However,  it  can  be  approximated 
using  a  Monte  Carlo  experiment.  For  this  example,  the 
distribution  function  for  the  idealized  process,  Fk,  is 
the  sum  of  bivariate  normal  random  variables,  and  can 
be  calculated  exactly.  The  Robbins-Monro  algorithm 
with  step  size  =  a/k  was  stopped  after  k  steps  us¬ 
ing  9o  =  9*  and  ak  =  10.  The  scatterplots  in  Figure  1 
were  generated  by  taking  100,000  such  runs. 

It  is  apparent  from  the  plots  that  Gk  is  a  better 
approximation  to  Fk  than  F*,  even  for  high  iteration 
counts.  One  measure  of  how  well  Gk  approximates  Fk 
when  9q  =  9*  is  to  look  at  the  Kullback-Leibler  dis¬ 
tance  between  the  sample  probability  mass  function 
and  values  of  the  distribution  of  the  idealized  process. 
In  this  example,  computations  show  that  the  Kullback- 
Leibler  distance  is  actually  decreasing  with  increas¬ 
ing  iterations.  This  is  mostly  an  artifact  of  choosing 
to  expand  about  9*.  An  expansion  about  the  initial 
point  9o,  say,  leads  to  an  approximation  that  is  ini¬ 
tially  good,  but  gets  worse  as  the  iteration  progresses. 
In  this  instance  one  might  ’’restart”  the  process  by  re¬ 
linearizing  the  gradient  (at  the  current  point,  say)  and 
continuing  the  process  from  there. 
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(b)  K  =  10,000. 


Figure  1:  The  figures  above  show  a  scatterplot  of  0^ 
overlayed  with  the  approximate  95th  percentile  ellipse 
of  the  distribution  (solid  ellipse)  for  the  value  of  k 
shown  when  9o  =  9*.  The  95th  percentile  ellipse  for  0^ 
is  indistinguishable  from  the  solid  ellipse  in  this  scale, 
and  therefore  is  not  shown.  The  dashed  ellipse  repre¬ 
sents  the  95th  percentile  of  the  asymptotic  distribution 
of  0K. 


5.  CONCLUSIONS 

There  currently  is  no  systematic  way  to  select  the 
idealized  process  or  to  evaluate  alternatives.  The  prob¬ 
lem  with  the  linearized  transformation  is  that  informa¬ 
tion  on  the  Hessian  is  needed,  and  in  a  practical  ap¬ 
plication  it  must  be  estimated  from  the  observations. 
This  was  a  drawback  of  the  sequential  estimation  ap¬ 
proach  in  small  finite  samples,  and  with  a  linearized 
transformation  we  cannot  escape  it  here  (though  it 
may  be  better  to  use  this  information  to  estimate  U 
rather  than  F*).  Other  autoregressive  processes  show 
potential. 

Additional  advantages  to  this  approach  include  rela¬ 
tive  computational  efficiency  compared  to  other  meth¬ 
ods  and  the  potential  to  take  into  account  in  an  ex¬ 
plicit  manner  the  impact  of  a  poorly  selected  9q.  The 
questions  that  remain  point  out  the  need  for  a  more 
general  theory  for  idealized  processes. 
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